Look like there are some Null values within the data frame. To treat outliers, we need to know how influential those points are with respect to the data frame. We can check that with the amount of covariance it has with other Quantities within the given data.

But there are not two peaks that are significantly different indicating that possibly there is not much difference in radius ratio but still there is a little two peaks.

Noting here how radius ratio is where the clear divisions are noted.

Observations: Here we have observed that the differences among all features between bus and van are not much but the difference between those of car and bus are significant, and are not just omnidirectional. They are both enhancements and deprecations. Particularly if we view it in a lot more detail.

  1. The maximum scaled variance differs significantly in all three types of models meaning that a bigger model like bus is likely to have a bigger amount of scaled variance than for much smaller vehicles.
  2. So, while classification, it is most likely the decisive variable that dictates whether a given vehicle is a car, a van or a bus.
  3. There are as many number of cars as there are buses and vans combined.

Note : We can see that elongatedness has negative correlation with a lot of other columns and has positive correlation with only a few other columns. It remains to be seen how important those features are important to predict the class of the vehicle.

We can simply drop those data point entries from the table and go ahead.

Observations: Absolute disaster with respect to the covariance observed here. No two components seem to be going together here, except one which is scaled_variance.1. That might turn out to be THE most important feature with respect to prediction of the class of the vehicle. But, let us first transform the data and check if that changes.

Looks like the number of components is making no difference at all to the amount of ratio that can be explained. So, we fix the no of components to just 7

This is huge amount of explanation that can be obtained out of just 5 or 4 variables variables. Let us stick to thse 7 variables for now.

Observation : Note here that almost all the principal components seem to be clustering along one single mean point. But there is one component, viz. component no 2 which is completely distributed and skewed to one particular side with almost all the variables. But most probably because of outliers that this particular trend might be happening, as is visible here in the plots. All the others have promptly fallen into line.

Note here : The Correlation and Covariance show no change in the data after transformation just like it shouldn't. Now we can go ahead with model fitting for Support Vector Machine. After that we can go ahead with setting up a data such that it can take inverse transformation from the given output to check the data.

Now Let us check what is going to be difference if we don't do PCA for the very same configurations and so.

Observation: The Accuracy of the model definitely improved a lot. PCA done model seems to be be performing well with respect to test score and trainscore.

Notes and Observations :

There is an obvious improvement in the test score and train score of the given data. It is visible in the above two cells which is very much different. But as the no of data points changes, most probably they do change the scoring on the models. There is approximately 10% performance boost due to dimensionality reduction.